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Abstract. Given a string S of length n, its maximal unbordered factor 
is the longest factor which does not have a border. In this work we 
investigate the relationship between n and the length of the maximal 
unbordered factor of S. We prove that for the alphabet of size cr > 5 the 
expected length of the maximal unbordered factor of a string of length n 
is at least 0.99n (for sufficiently large values of n). As an application 
of this result, we propose a new algorithm for computing the maximal 
unbordered factor of a string. 

1 Introduction 

If a proper prefix of a string is simultaneously its suffix, then it is called a border 
of the string. Given a string S of length n, its maximal unbordered factor is the 
longest factor which does not have a border. The relationship between n and the 
length of the maximal unbordered factor of S has been a subject of interest in 
the literature for a long time, starting from the 1979 paper of Ehrenfeucht and 
Silberger [7]. 

Let b{S) be the length of the maximal unbordered factor of S and 7r(S') be 
the minimal period of S. Ehrenfeucht and Silberger showed that if the minimal 
period of S is smaller than ^n, then b{S) = 7 t{S). Eollowing this, they raised 
a natural question: How small b{S) must be to guarantee b{S) = 7r(S')? Their 
conjecture was that b{S) must be smaller than ^n. However, this conjecture was 
proven false two years later by Assous and Pouzet [1]. As a counterexample they 
gave a string 


S = a'^ba'^+^baJ^ba'^+Ha'^ba^+^ba!^ 

of length n = 7 to+ 10. The length of the maximal unbordered factor of this string 
is b{S) = 3m + 6 < |n + 2 < (with ba™'^^ba'^ba'^^^ and d^^^ba^ba^^^b 
being unbordered), and the minimal period 7r(S') = 4m + 7 ^{S)- 

The next attempt to answer the question was undertaken by Duval [3]: He 
improved the bound to jn+|. But the final answer to the question of Ehrefeucht 
and Silberger was given just recently by Holub and Nowotka [10]. They showed 
that b{S) < implies b{S) = 7r(5'), and, as follows from the example of Assous 
and Pouzet, this bound is tight. 



Therefore, when either b{S) or 7r(S') is small, b{S) = tt{S). Exploiting this 
fact, one can even compute the maximal unbordered factor itself in linear time. 
The key idea is that in this case the maximal unbordered factor is an unbordered 
conjugate of the minimal period of S, and both the minimal period and its 
unbordered conjugate can be found in linear time [15,6]. 

The interesting cases are those where b{S) (and, consequently, 7r(5')) is big. 
Yet, it is generally believed that they are the most common ones. This is sup¬ 
ported by experimental resuts shown in Fig. 1 that plots the average difference 
between the length n of a string and the length of its maximal unbordered factor. 
Guided by the experimental results, we state the following conjecture: 

Conjecture 1. Expected length of the maximal unbordered factor of a string of 
length n is n — 0{l). 



Fig. 1: Average difference between the length n of a string and the length of its 
maximal unbordered factor for 1 < n < 100 and alphabets of size 2 < cr < 5. 


To the best of our knowledge, there have been no attempts to prove the 
conjecture or any lower bound at all in the literature. In Section 4 we address 
this gap and make the very first step towards proving the conjecture. We show 
that the expected length of the maximal unbordered factor of a string of length 
n over the alphabet A of size cr > 2 is at least n(l — ^(cr) • cr~‘^) + 0(1), where 
^(cr) is a function that converges to 2 quickly with the growth of a. In particular, 
this theorem implies that for alphabets of size cr > 5 the expected length of the 
maximal unbordered factor of a string is at least 0.99n (for sufficiently large 
values of n). To prove the theorem we developed a method of generating strings 








with large unbordered factors which we find to be interesting on its own (see 
Section 3). 

It follows that the algorithm for computing maximal unbordered factors we 
sketched earlier cannot be used in a majority of cases. Instead, one can consider 
the following algorithm. A border array of a string is an array containing the 
maximal length of a border of each prefix of this string. Note that a prefix of a 
string is unbordered exactly when the corresponding entry in the border array 
is zero. Therefore, to compute the maximal unbordered factor of a string S it 
suffices to build border arrays of all suffixes of a string. It is well-known that a 
single border array can be constructed in linear time, which gives quadratic time 
bound for the algorithm. In Section 5 we show how to modify this algorithm to 
make use of the fact that the expected length of the maximal unbordered factor 
is big. We give O(^) time bound for the modified algorithm, as well as confirm 
its efficiency experimentally. 

Related work. Apart from the aforementioned results, we consider our work to 
be related to three areas of research. 

As we have already mentioned, the maximal unbordered factor can be found 
by locating the rightmost zeros in the border arrays of suffixes of a string and 
better understanding of structure of border arrays would give more efficient 
algorithms for the problem. Structure of border arrays has been studied in [9, 8, 
5,4,14,2]. 

In contrast to the problem we consider in this work, one can be interested 
in the problem of preprocessing a string to answer online factor queries related 
to its borders. This problem has been considered by Kociumaka et al. [13,12]. 
They proposed a series of data structures which, in particular, can be used to 
determine if a factor is unbordered in logarithmic time. 

Finally, repeating fragments in a string (borders of factors is one example of 
such fragments) were studied in connection with the Longest Common Extension 
problem which asks, given a pair of positions i, j in a string, to return the longest 
fragment that occurs both at i and j. This problem has many solutions, yet 
recently Hie at al. [11] showed that the simplest solution, i.e. simply scanning 
the string and comparing pairs of letters starting at positions i and j, is the 
fastest on average. The authors also proved that the longest common extension 
has expected length smaller than where a is the size of the alphabet. 


2 Preliminaries 

We start by introducing some standard notation and definitions. 
Power sums. We will need the following identities. 

Fact 1. S{x) = i ^ ^ for all a; 1. 



Proof. 




^j.k + 1 — Xy 

^ x-1 ' 


{{k + l)x^ — l){x — 1) — — x) 

(x-l)2 


Simplifying, we obtain 


k 

S{x) = i x^~^ 

i=l 


k — (fc + 1) a;* + 1 

(:e- 1)2 


□ 


Corollary 1. S{x) = Yl!i=i i ^ + 0{x^ for x > 1.5. 

Strings. The alphabet A is a finite set of size cr. We refer to the elements of A as 
letters. A string over A is a finite ordered sequence of letters (possibly empty). 
Letters in a string are numbered starting from 1, that is, a string S of length n 
consists of letters S')!], S'[2],..., S[n]. The length n of S is denoted by |S|. A set 
of all strings of length n is denoted A". 

For 1 < t < j < n, S[z..j] is a factor of S with endpoints i and j. The factor 
S[l..j] is called a prefix of S, and the factor S[i..n] is called a suffix of S. A 
prefix (or a suffix) different from S and the empty string is called proper. 

If a proper prefix of a string is simultaneously its suffix, then it is called a 
border. For example, borders of a string ababa are a and aba. The maximal border 
of a string is its longest border. For S we define its border array B (also known 
as the failure function) to contain the lengths of the maximal borders of all its 
prefixes, i.e. B[i\ is the length of the maximal border of S'[l..i], i = l..n. The 
last entry in the border array, B[n], contains the length of the maximal border 
of S. It is well-known that the border array and therefore the maximal border 
of S can be found in 0{n) time and space [15]. 

A period of S is an integer tt such that for alH, 1 < i < n —tt, ^[z] = S'[z-|-7r]. 
The minimal period of a string has length n — B[n], and hence can be computed 
in linear time as well. 

Unbordered strings. A string is called unbordered if it has no border. Let b{i, a) 
be the number of unbordered strings in A*. Nielsen [16] showed that unbor¬ 
dered strings can be constructed in a recursive manner, starting from unbordered 
strings of length 2 and inserting new letters in the “middle”. The following the¬ 
orem is a corollary of the proposed construction method: 

Theorem 1 ([16]). The sequence is monotonically nonincreasing 

and it converges to a constant a, which satisfies a > 1 — <7“^ — a~^. 

Corollary 2 ([16]). b{i,a) > cr* — cr*“^ — cr*“2 qH p 



This corollary immediately implies that the expected length of the maximal 
unbordered factor of a string of length n is at least n{l — a~^ — cr“^). We improve 
this lower bound in the subsequent sections. We will make use of a lower bound 
on the number bj(i, a) of unbordered strings such that its first letter differs from 
the subsequent j letters. An example of such string for j = 2 is abcacbb. 

Lemma 1. bj{i,a) > {a — for all i> j + 1- 

Proof. The number of such strings is equal to b(i, a) minus the number bj{i, a) 
of unbordered strings of length i that do not have the property. We estimate the 
latter from above by the number of such strings in the set of all strings with 
their first letter not equal to the last letter. Hence, bj{i,a) < {a — — {a— 

l)i+i 0 .*-i-i_ Recall that b{i,a) > cr* — by Theorem 1. The claim 

follows. □ 

Remark. The right-hand side of the inequality of Lemma 1 is often negative for 
a = 2. We will not use it for this case. 

The maximal unbordered factor of a string (MUF) is naturally defined to be 
the longest factor of the string which is unbordered. 

3 Generating strings with large MUF 

In this section we explain how to generate strings of some fixed length n with 
large maximal unbordered factors. To show the lower bounds we announced, we 
will need many of such strings. The idea is to generate them from unbordered 
strings. 

Let S be an unbordered string of length i> l"^]. Consider a string SPi ... Pk 
of length n, where Pi,... ,Pk are prefixes of S. It is not difhcult to see that the 
maximal unbordered factor of any string of this form has length at least i. 
(Because S is one of its unbordered factors.) The number of such strings that 
can be generated from S is 2”“*“^, because each of them corresponds to a 
composition of n — i, i.e. representation of n — f as a sum of a sequence of strictly 
positive integers. But, some of these strings can be equal. Consider, for example, 
an unbordered string S = aaabab. Then the two strings aaababaaa (S appended 
with its prefix aaa) and aaababaaa {S appended with its prefixes a and aa) will 
be equal. However, we can show the following lemma. 

Lemma 2. Let Si y S 2 be two unbordered strings. Any two strings of the form 
above generated from Si and S 2 are distinct. 

Proof. Suppose that the produced strings are equal. If |S'i| = |S' 2 |, we immedi¬ 
ately obtain = 5 ' 2 , a contradiction. Otherwise, w.l.o.g. assume lAil < | 5 ' 2 |. 
Then S 2 is equal to a concatenation of and some of its prefixes. The last of 
these prefixes is simultaneously a suffix and a prefix of S 2 , i.e. S 2 is not unbor¬ 
dered. A contradiction. □ 



Our idea is to produce as many strings of the form SPi... Pk as possible, 
taking extra care to ensure that all strings produced from a fixed string S are 
distinct. From unbordered strings of length i = n and z = n — 1 we produce just 
one string of length n. (For z = n it is the string itself and for z = rz — 1 it is the 
string appended with its first letter.) For unbordered strings of length i < n — 2 
we propose a different method based on the lemma below. 

Lemma 3. Each unbordered string S of length i such that its first letter differs 
from the subsequent j letters, where [n/ 2 ] < i < n — j, gives at least 2 ^ distinct 
strings of the form SPi... Pk- 

Proof. We choose the last prefix Pk to be the prefix of S of length at least 
n — i — j. We place no restrictions on the first k — 1 prefixes. 

Let us start by showing that all generated strings are distinct. Suppose there 
are two equal strings S'Pi ... Pe and SP{ .. .Pf. Let Pd, P^ be the first pair of 
prefixes that have different lengths. W.l.o.g. assume that \Pd\ < |P^|. Then d i 
and hence \Pd\ < j = n — i — {n — i — j). It follows that P^ (which is a prefix 
of S) contains at least two occurrences of S[l], one at the position 1 and one at 
the position \Pd\ + 1 < j + 1. In other words, we have S[I] = S[|Pd| + 1] and 

1- fdl + 1 S J + 1, which contradicts our choice of S. 

If the length of the last prefix is fixed to some integer m > n — i—j, then each 
of the generated strings SPi... Pk is defined by the lengths of the first /c— I of the 
appended prefixes. In other words, there is one-to-one correspondence between 
the generated strings and compositions oi n — i — m. (Here we use z > |"zz/2] 
to ensure that every composition corresponds to a sequence of prehxes of S.) 
The number of compositions of rz — z — rrz is 1 when m = n — i and 
otherwise. Summing up for all m from n — i—j to rz — z we obtain that the 
number of the generated strings is 2 -^. □ 

Let us estimate the total amount of strings produced by this method. We 
produce one string from each unbordered string of length z. Then, from each 
unbordered string of length z such that its first letter differs from the second 
letter, we produce 1 = 2—1 more string. If the first letter differs both from 
the second and the third letters, we produce 2 = 2 ^ — 1 — 1 more strings. And 
finally, if the first letter differs from the subsequent j letters, we produce 2^~^ = 

2- 1 — (l -I- 1 -I- 2 -I- ... -I- 2^“^) strings. It follows that the number of strings we can 
produce from unbordered strings of length z < rz — 2 is 

n—i— 1 

b{i,(j)+ ^ 2 ^"i-&j(i,CT) 
i=i 

Recall that the maximal unbordered factor of each of the generated strings has 
length at least z and that none of them can be equal to a string generated from 
an unbordered string of different length. 

4 Expected length of MUF 

In this section we prove the main result of this paper. 



Theorem 2. Expected length of the maximal unbordered factor of a string of 
length n over an alphabet A of size a >2 is at least 

n.{l-aa)-a-^) + 0{l) ( 1 ) 

where ^(2) = 8 and f{a) = for a > 2. 

Before we give a proof of the theorem, let us say a few words about ^(cr). 
This function is monotonically decreasing for cr > 2 and quickly converges to 2. 
We give the first four values for f{a) (rounded up to 3 s.f.) and 1 — f{a) ■ a~‘^ 
(rounded down to 3 s.f.) in the table below. 



0 - = 2 

a = S 

= 4 

fj = 5 


8.000 

7.200 

4.800 

3.922 

1 

b 

1 

0.500 

0.911 

0.981 

0.993 


Corollary 3. Expected length of the maximal unbordered factor of a string of 
length n over the alphabet A of size a is at least 0.99n (for sufficiently large 
values ofn). 


Proof of Theorem 2. Let /3f (ct) be the number of strings in yl" such that the 
length of their maximal unbordered factor is i. Expected length of the maximal 
unbordered factor is then equal to 

1 ” 

For the sake of simplicity, we temporarily omit and only in the very end we 
will add it back. Recall that in the previous section we showed how to generate 
a set of distinct strings of length n with maximal unbordered factors of length 
at least i which contains 


Ki,a)+ ^ 2^ ^ ■bj{i,a) 
i=i 

strings for all < i < n — 2 and b{i, a) strings for i = {n — l,n}. Then 

n n n—2 n—i — 1 

^fo/3f(cr)> i-b{i,a)+ Y XI 

i=l i=[n/2] i=[n/2] j = l 

'-V-" '-V-' 

(Si) (S 2 ) 


( 2 ) 



We start by computing (Si). Applying Corollary 2 and replacing 6(i, a) with 
in (5'i), we obtain: 


i— [ 2 1 *— r 2 1 


I (7 


i-l\ 


Note that the lower limit in inner sum of (S'!) can be replaced by one because 
the correcting term is small: 


b{n,a) ^ • b{n,a) 

-^ y IC <■ - rr - 

ct"-1 ^ ~ 4cr"/^ 

i—1 


C>(a”) 


We hnally use Corollary 1 for x = a and k = n to compute the right-hand side 
of the inequality: 

71CT 

{Si)>—-h{n,a) + 0{a-) (3) 

(T — 1 

We note that for cr = 2 the right-hand side is at least 2n-(2" —2”“^ —2"“^)-|- 
0(2") = n ■ 2"-i -h 0(2") by Corollary 2 and ( 5 * 2 ) > 0. Hence, YTi=i * ’ A”(2) > 
n , 2 "-i -(- 0(2"). Dividing both sides by 2", we obtain the theorem. 

Below we assume ct > 2 and for these values of a give a better lower bound 
on (5'2). Recall that bj{i, cr) > (cr — — cr*“^ (see Lemma 1). It follows 

that 

_ 2 77._2_X 

(^2)> E ^ 2^-i.*.((a-l)^+V-^-i-a-2) 

i=\n/2-] j = l 

Let us change the order of summation: 

[n/2j-l n-j-1 

( 32 ) > ^ 2 -’“^ • ((ct — 1 )'^'''^CT“'’ — CT“^) ^ i ■ CT*“^ 

j = l i=[n/2] 


We can replace the lower limit in the inner sum of (5'2) by one as it will only 
change the sum by 0 (ct"). After replacing the lower limit, we apply Corollary 1 
to compute the inner sum: 

L"/2J-i .-j-l 

{S2)> Y. 2^-1. ((ct-1)^+V-^ -ct-i) •(n-j-l)^-^+0(CT") 

We divide the sum above into positive and negative parts: 

L"/2J-1 L"/2J-1 n-j-2 

Y (n-j - 1) 2^~^{a- Y (^ - J - ^ 

j=i _^ i=i 

(P) 


(N) 



We start by computing (N). We again apply the trick with the lower limit and 
Fact 1, and replace (n — j — 1) with k. 

k —r 

Computing (P) is a bit more involved. We divide it into two parts: 


t=i 


1"/2J-1 


-1 j2^ 


3^-2] 


3 = 1 


i?l i?2 

(i?i) is a sum of a geometric progression and it is equal to 

(n-l)a”-i _ (n-l)a”-i 2(a - 1) 


2(<t-1) 


- 1 


— 2a + 2 


0 (a") 


Lemma 4. (R 2 ) = 0(ct"). 

Proof. We start our proof by rewriting (i?2): 

1"/2J-1 


(ijs) = V -1) ■ E 


3=1 


We apply Fact 1 for x = ^*•'^2 and k = \ n/2\ — 1 to compute the inner sum. 

-(2(^-1) 1)2- 

The claim follows. 


□ 

We now summarize our findings. From equations for (P), (fV), (Pi), and 
(P2) we obtain (after simplification): 

— _n 1 ^TL 2 

{S2)>{P)-{N) = n-{^——-- -—-^)+ 0(0 ( 4 ) 

— 2(7 + 2 (ct —1)(ct —2)' 

We now return back to Equation (2) and use our lower bounds for (^i) and 
{S 2 ) together with Corollary 2 for b{n,a): 


^i-Pf{a) >n-{- 


a — 1 


+ 


— 2(7 + 2 (a — 1)((7 — 2) 


+ 0(a" 


We now simplify the expression above and return back ^ as we promised in 
the very beginning of the proof to obtain: 

1 ” 

- {1- C(cr) • cr"'‘) + 0(1) (5) 

where f{a) = (^cr- 2 )(< 7 ^‘%a+ 2 ) ■ completes the proof of Theorem 2. □ 



Remark. Theorem 2 actually provides a lower bound on the expected length 
of the maximal unbordered prefix (rather than that of the maximal unbordered 
factor), which suggests that this bound could be improved. 

5 Computing MUF 

Based on our findings we propose an algorithm for computing the maximal 
unbordered factor of a string S of length n and give an upper bound on its 
expected running time. A basic algorithm would be to compute the border arrays 
(see Section 2 for the definition) of all suffixes of S. The border arrays contain 
the lengths of the maximal borders of all prefixes of all suffixes of S, i.e., of all 
factors of S. It remains to scan the border arrays and to select the longest factor 
such that the length of its maximal border is zero. Since a border array can be 
computed in linear time, the running time of this algorithm is 0 (n^). 

The algorithm we propose is a minor modification of the basic algorithm. We 
build border arrays for suffixes of S starting from the longest one. After building 
an array Bi for S'[i..n] we scan it and locate the longest factor S'[i..j] such that 
the length of its maximal border stored in Bfij] is zero. We then compare 
and the current maximal unbordered factor (initialized with an empty string). If 
S'[i..j] is longer, we update the maximal unbordered factor and proceed. At the 
moment we reach a suffix shorter than the current maximal unbordered factor, 
we stop. 

Theorem 3. The maximal unbordered factor of a string of length n over an 
alphabet A of size a can be found in O(^) expected time. 

Proof. Let b{S) be the length of the maximal unbordered factor of S. Then the 
running time of the algorithm is 0{{n — b{S)) • n), because b{S) will be a prefix 
of one of the first n — b{S) + 1 suffixes of S (starting from the longest one). 
Averaging this bound over all strings of length n, we obtain that the expected 
running time is 

^(4 E • (4 E - ^(^)))) 

^ seA’^ ^ seA’^ 

and ^ J2seA”'(^ ~ it follows from Theorem 2 and properties 

of f{a). □ 

We performed a series of experiments to confirm that the expected run¬ 
ning time of the proposed algorithm is much smaller than that of the basic 
algorithm. We compared the time required by the algorithms for strings of 
length 1 < n < 100 over alphabets of size a = {2,3,4,5,10}. The time re¬ 
quired by the algorithms was computed as the average time on a set of size 
N = 10® of randomly generated strings of given length. The experiments were 
performed on a PC equipped with one 2.6 GHz Intel Core i5 processor. As it can 
be seen in Fig. 2, the minor modification we proposed decreases the expected 



running time dramatically. Obtained results were similar for all considered al¬ 
phabet sizes. All source files, results, and plots can be found in a repository 
http://github.com/avlonger/unbordered. 



Fig. 2: Average running times of the proposed algorithm (dashed line) and the 
basic algorithm (solid line) for strings over the alphabet of size a = 2. 


We note that the data structures [13,12] can be used to compute the maxi¬ 
mal unbordered factor in a straightforward way by querying all factors in order 
of decreasing length. This idea seems to be very promising since these data 
structures need to be built just once, for the string S itself. However, the data 
structures are rather complex and both the theoretical bound for the expected 
running time, which \s logn), and our experiments show that this solution 
is slower than the one described above. 

6 Conclusion 

We consider the contributions of this work to be three-fold. We started with an 
explicit method of generating strings with large unbordered factors. We then 
used it to show that the expected length of the maximal unbordered factor and 
the minimal period of a string of length n is f2(n), leaving the question raised 
in Conjecture 1 open. As an immediate application of our result, we gave a new 
algorithm for computing maximal unbordered factors and proved its efficiency 
both theoretically and experimentally. 
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